Methods and apparatus to improve operation of a graphics processing unit

ABSTRACT

Methods, apparatus, systems, and articles of manufacture are disclosed to improve operation of a graphics processing unit (GPU). An example apparatus includes an instruction generator to insert profiling instructions into a GPU kernel to generate an instrumented GPU kernel, the instrumented GPU kernel is to be executed by a GPU, a trace analyzer to generate an occupancy map associated with the GPU executing the instrumented GPU kernel, a parameter calculator to determine one or more operating parameters of the GPU based on the occupancy map, and a processor optimizer to invoke a GPU driver to adjust a workload of the GPU based on the one or more operating parameters.

RELATED APPLICATION

This patent arises from a continuation of U.S. patent application Ser.No. 16/129,525, (now U.S. Pat. No. ______) which was filed on Sep. 12,2018. U.S. patent application Ser. No. 16/129,525 is hereby incorporatedherein by reference in its entirety. Priority to U.S. patent applicationSer. No. 16/129,525 is hereby claimed.

FIELD OF THE DISCLOSURE

This disclosure relates generally to computers and, more particularly,to methods and apparatus to improve operation of a graphics processingunit (GPU).

BACKGROUND

Software developers seek to develop code that may be executed asefficiently as possible. To better understand code execution, profilingis used to measure different code execution statistics such as, forexample, execution time, memory consumption, etc. In some examples,profiling is implemented by insertion of profiling instructions into thecode. Such profiling instructions can be used to store and analyzeinformation about the code execution.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram illustrating an example binary instrumentationengine inserting profiling instructions into a GPU kernel in accordancewith teachings of this disclosure.

FIG. 2 depicts an example trace buffer generated in accordance withteachings of this disclosure.

FIG. 3 is a block diagram of the example binary instrumentation engineof FIG. 1 in accordance with teachings of this disclosure.

FIG. 4 depicts an example occupancy map generated in accordance withteachings of this disclosure.

FIG. 5 is a flowchart representative of machine readable instructionswhich may be executed to implement the example binary instrumentationengine of FIGS. 1 and 3 to improve operation of a GPU.

FIG. 6 is a flowchart representative of machine readable instructionswhich may be executed to implement the example binary instrumentationengine of FIGS. 1 and 3 to process the example trace buffer of FIG. 2 togenerate the example occupancy map of FIG. 4.

FIG. 7 is a block diagram of an example processing platform structuredto execute the instructions of FIGS. 5-6 to implement the example binaryinstrumentation engine of FIGS. 1 and 3.

The figures are not to scale. In general, the same reference numberswill be used throughout the drawing(s) and accompanying writtendescription to refer to the same or like parts.

DETAILED DESCRIPTION

A graphics processing unit (GPU) is an electronic circuit that executesinstructions to modify contents of a buffer. Typically, the buffer is aframe buffer that is used to output information to a display device(e.g., a monitor, a touchscreen, etc.). Recently, GPUs have been usedfor tasks that are not necessarily related to generating output images.

GPUs execute instruction packages commonly referred to as kernels,compute kernels, and/or shaders. Typically, the term shader is used whena kernel is used for graphics-related tasks such as, for example,DirectX, Open Graphics Library (OpenGL) tasks, pixel shader/shadingtasks, vertex shader/shading tasks, etc. The term kernel is used forgeneral purpose computational tasks such as, for example, Open ComputingLanguage (OpenCL) tasks, C for Media tasks, etc. While exampleapproaches disclosed herein use the term kernel, such approaches areequally well suited to be used on shaders. Such kernels roughlycorrespond to an inner loop of a program that is iterated multipletimes. As used herein, a GPU kernel refers to a kernel in binary format.A GPU programmer develops kernels/shaders in a high-level programminglanguage such as, for example, a High-Level Shader Language (HLSL),OpenCL, etc., and then compiles the code into a binary version of thekernel which is then executed by a GPU. Example approaches disclosedherein are applied to the binary version of the kernel.

Developers want to create the most computationally efficient kernels toperform their desired task. To gain a better understanding of theperformance of a kernel, developers use a profiler and/or profilingsystem to collect operational statistics (e.g., performance statistics)of the kernel. Profilers insert additional instructions into the kernelto collect such operational statistics. However, prior profilers and/orprofiling systems are used to determine occupancy of a centralprocessing unit (CPU). Prior profilers and/or profiling systemsdetermine the occupancy of the CPU because an operating system runningon the CPU provides visibility of the CPU utilization for each of thecores and threads of the CPU. However, GPUs do not have an operatingsystem running on the GPUs and, therefore, do not have an ability tomeasure busy and idle time intervals at the granularity of the executionunits and hardware threads of the GPUs.

Examples disclosed herein improve operation of a GPU by measuringoperating parameters of the GPU and determining whether to adjustoperation of the GPU based on the measured operating parameters. In somedisclosed examples, one or more processors included in a centralprocessing unit (CPU) determines one or more operating parameters (e.g.,operational statistics, performance statistics, etc.) associated withthe GPU including at least one of a busy time parameter, an idle timeparameter, an occupancy time parameter, or a utilization parameter. Asused herein, a busy time of the GPU refers to a time interval, a timeduration, etc., when a hardware thread of the GPU is busy executing acomputational task. As used herein, an idle time of the GPU refers to atime interval, a time duration, etc., when a hardware thread of the GPUis not executing a computational task. As used herein, an occupancy ofthe GPU refers to a set of busy and/or idle time intervals associatedwith an execution unit and/or hardware thread of the GPU duringexecution of one or more computational tasks. As used herein,utilization of the GPU refers to a ratio of the busy time and a totaltime associated with the execution of the one or more computationaltasks.

In some disclosed examples, the CPU inserts additional instructions intokernels to collect information corresponding to the one or moreoperating parameters associated with the kernels. Additionalinstructions may include profiling instructions to instruct the GPU torecord and/or otherwise store timestamps associated with a start time,an end time, etc., of an execution of the kernel. For example, when theGPU executes a kernel that includes the additional instructions, the GPUmay store a start time associated with starting an execution of thekernel and an end time associated with ending the execution of thekernel. The GPU may store the timestamps and a corresponding hardwarethread identifier in a trace buffer in memory. In such examples, the CPUmay obtain the trace buffer and determine the one or more operatingparameters based on information included in the trace buffer. In somedisclosed examples, the CPU can determine that the GPU can executeadditional computational tasks, fewer additional tasks, etc., based onthe one or more operating parameters and, thus, improve operation of theGPU, scheduling operations of the CPU, etc.

FIG. 1 is a block diagram illustrating an example binary instrumentationengine 100 inserting example profiling instructions 102 into a firstexample GPU kernel 104 to generate a second example GPU kernel 106 to beexecuted by an example GPU 108. The second GPU kernel 106 is aninstrumented GPU kernel. The GPU 108 may use the profiling instructions102 to generate example profile data 110. The profile data 110corresponds to data generated by the GPU 108 in response to executingthe profiling instructions 102 included in the second kernel 106. Thebinary instrumentation engine 100 may obtain and analyze the profiledata 110 to better understand the execution of the second kernel 106 bythe GPU 108. The binary instrumentation engine 100 may determine toadjust operation of the GPU 108 based on analyzing the profile data 110.

In some examples, the profiling instructions 102 create and/or storeoperational information such as, for example, counters, timestamps,etc., that can be used to better understand the execution of a kernel.For example, the profiling instructions 102 may profile and/or otherwisecharacterize an execution of the second kernel 106 by the GPU 108. Insome examples, the profiling instructions 102 are inserted at a firstaddress (e.g., a first position) of a kernel (e.g., the beginning of thefirst kernel 104) to initialize variables used for profiling. In someexamples, the profiling instructions 102 are inserted at locationsintermediate the original instructions (e.g., intermediate theinstructions from the first kernel 104). In some examples, the profilinginstructions 102 are inserted at a second address (e.g., a secondposition) of the kernel (e.g., after the instructions from the firstkernel 104) and, when executed, cause the GPU 108 to collect and/orotherwise store the metrics that is accessible by the binaryinstrumentation engine 100. In some examples, the profiling instructions102 are inserted at the end of the kernel (e.g., the first kernel 104)to perform cleanup (e.g., freeing memory locations, etc.). However, suchprofiling instructions 102 may additionally or alternatively be insertedat any location or position and in any order.

In the illustrated example of FIG. 1, an example CPU 112 includes thebinary instrumentation engine 100, an example application 114, anexample GPU driver 116, and an example GPU compiler 118. The application114 may be used to display an output from the GPU 108 when the GPU 108executes graphics-related tasks such as, for example, DirectX tasks,OpenGL tasks, pixel shader/shading tasks, vertex shader/shading tasks,etc. Additionally or alternatively, the application 114 may be used todisplay and/or otherwise process outputs from the GPU 108 when the GPU108 executes non-graphics related tasks. Additionally or alternatively,the application 114 may be used by a GPU programmer to facilitatedevelopment of kernels/shaders in a high-level programming language suchas, for example, HLSL, OpenCL, etc.

In FIG. 1, the application 114 transmits tasks (e.g., computationaltasks, graphics-related tasks, non-graphics related tasks, etc.) to theGPU driver 116. The GPU driver 116 receives the tasks and instructs theGPU compiler 118 to compile code associated with the tasks into a binaryversion (e.g., a binary format corresponding to binary code, binaryinstructions, machine readable instructions, etc.) to generate the firstkernel 104. The GPU compiler 118 transmits the compiled binary versionof the first kernel 104 to the GPU driver 116.

The binary instrumentation engine 100 of FIG. 1 obtains the first kernel104 (e.g., in a binary format) from the GPU driver 116. The binaryinstrumentation engine 100 instruments the first kernel 104 by insertingadditional instructions such as the profiling instructions 102 into thefirst kernel 104. As used herein, an instrumented kernel refers to akernel that includes profiling and/or tracing instructions to beexecuted to measure statistics or monitor an execution of the kernel.For example, the binary instrumentation engine 100 may modify the firstkernel 104 to create an instrumented GPU kernel such as the secondkernel 106. That is, the binary instrumentation engine 100 creates thesecond kernel 106 without executing any compilation of the GPU kernel.In this manner, already-compiled GPU kernels can be instrumented and/orprofiled. The second kernel 106 is passed to the GPU 108 via examplememory 120. For example, the binary instrumentation engine 100 maytransmit the second kernel 106 to the GPU driver 116, which, in turn,stores the second kernel 106 in the memory 120 for retrieval by the GPU108.

The GPU 108 uses the profiling instructions 102 of FIG. 1 to generatethe profile data 110. In FIG. 1, the profiling instructions 102 includea first example instruction 102 a of “A=RDTSC” inserted at a firstposition, where the first instruction 102 a corresponds to a read (RD)operation of a register (e.g., a hardware register) associated with atime-stamp counter (TSC) and a store operation of a first value of theregister in a variable A. The profiling instructions 102 include asecond example instruction 102 b of “B=RDTSC” inserted at a secondposition, where the second instruction 102 b corresponds to reading theregister associated with the TSC and storing a second value of theregister in a variable B. The profiling instructions 102 include a thirdexample instruction 102 c of “Trace (A, B, HW-thread-ID)” at a thirdposition, where the third instruction 102 c corresponds to generating atrace and storing the variables A, B, and an identifier (ID) of ahardware (HW) thread (HW-THREAD-ID) in the trace. For example, the tracemay refer to a sequence of data records that are written (e.g.,dynamically written) into a memory buffer (referred to herein as a tracebuffer).

In FIG. 1, the HW-THREAD-ID corresponds to a hardware thread thatexecuted the second kernel 106 including example GPU instructions 122disposed between the first instruction 102 a and the second instruction102 b. In response to executing the profiling instructions 102 and theGPU instructions 122, the GPU 108 stores the trace that includesinformation included in the variables A, B, and HW-THREAD-ID in anexample trace buffer 124 included in the profile data 110. The tracebuffer 124 includes example records 126. For example, a first one of therecords 126 in FIG. 1 is [A₁, B₁, 7], where A₁ corresponds to a firsttimestamp, B₁ corresponds to a second timestamp, and 7 corresponds to ahardware thread identifier, where the second timestamp is after thefirst timestamp. The first timestamp (A₁) of the first one of therecords 126 may correspond to when a hardware thread with a hardwarethread identifier of 7 begins executing the instrumented GPU kernel 106.The second timestamp (B₁) of the first one of the records 126 maycorrespond to when the hardware thread with the hardware threadidentifier of 7 concludes executing the instrumented GPU kernel 106.

In the illustrated example of FIG. 1, the memory 120 includes one ormore kernels such as the second kernel 106, the profile data 110, andexample GPU data 128. Alternatively, the memory 120 may not store one ormore kernels. The data 128 corresponds to data generated by the GPU 108in response to executing at least the second kernel 106. For example,the data 128 may correspond to graphics-related data, output informationto a display device, etc.

The profile data 110 includes the trace buffer 124, which is an exampleimplementation of an example trace buffer 200 depicted in theillustrated example of FIG. 2. The trace buffer 200 of FIG. 2 representsan example format that may be used by the GPU 108 to generate the tracebuffer 124 of FIG. 1. In FIG. 2, the trace buffer 200 is a buffer thatincludes a plurality of example records 202. In FIG. 2, the records 202may correspond to the records 126 of FIG. 1. For example, a first one ofthe records 202 of FIG. 2 may correspond to the first one of the records126 of FIG. 1. Each of the records 202 includes example data fields(e.g., data entries) 204, 206, 208 including a first example data field204, a second example data field 206, and a third example data field208. Alternatively, one or more of the records 202 may include fewer ormore data fields than depicted in FIG. 2. In FIG. 2, the first datafield 204 is a first data storage unit that stores a first value of atimestamp counter (A) associated with a hardware thread executing thesecond kernel 106. The second data field 206 is a second data storageunit that stores a second value of the timestamp counter (B), where thesecond value is greater than the first value. For example, the firstvalue may correspond to a first time and the second value may correspondto a second time, where the second time is after or later than the firsttime. In FIG. 2, the third data field 208 is a third data storage unitthat stores an identifier of the hardware thread (THREAD ID).

In the illustrated example of FIG. 2, the trace buffer 200 is generatedin an atomic manner. For example, the GPU 108 may generate the tracebuffer 200 sequentially where a first one of the records 202 is adjacentto a second one of the records 202, where the first one of the records202 is generated prior to the second one of the records 202. The GPU 108generates the records 202 from different hardware threads that areintermixed in the trace buffer 200. For example, the trace buffer 200may not be stored in chronological order, in order of hardware threadidentifier, etc. For example, two records k and m having the samehardware thread identifier have the following characteristics: if k<m,then Ak<Bk<Am<Bm.

Turning back to FIG. 1, the binary instrumentation engine 100 retrieves(e.g., iteratively retrieves, periodically retrieves, etc.) the tracebuffer 124 from the memory 120. In some examples, the binaryinstrumentation engine 100 determines one or more operating parametersassociated with the second kernel 106, and/or, more generally, the GPU108. For example, the binary instrumentation engine 100 may determine abusy time parameter, an idle time parameter, an occupancy timeparameter, and/or a utilization parameter. In some examples, the binaryinstrumentation engine 100 adjusts operation of the GPU 108 based on theone or more operating parameters. For example, the binaryinstrumentation engine 100 may instruct the CPU 112 to schedule anincreased quantity of instructions to be performed by the GPU 108, adecreased quantity of instructions to be performed by the GPU 108, etc.,based on the one or more operating parameters.

FIG. 3 is a block diagram of the binary instrumentation engine 100 ofFIG. 1 to improve operation of the GPU 108 of FIG. 1. The binaryinstrumentation engine 100 instruments binary shaders/kernels prior tosending them to the GPU 108. The binary instrumentation engine 100collects traces including timestamps associated with when theinstrumented code is executed by the GPU 108. The binary instrumentationengine 100 generates an occupancy map and/or one or more operatingparameters based on the collected traces, where the occupancy map and/orthe one or more operating parameters may be used to improve operation ofthe GPU 108, the CPU 112, etc. In the illustrated example of FIG. 3, thebinary instrumentation engine 100 includes an example instructiongenerator 300, an example trace analyzer 310, an example parametercalculator 320, and an example processor optimizer 330.

In the illustrated example of FIG. 3, the binary instrumentation engine100 includes the instruction generator 300 to instrument kernels such asthe first kernel 104 of FIG. 1. For example, the instruction generator300 may access the first kernel 104 (e.g., access the first kernel 104from memory included in the CPU 112). The instruction generator 300 mayinstrument the first kernel 104 to generate the second kernel 106 ofFIG. 2. For example, the instruction generator 300 may generate andinsert binary code associated with the profiling instructions 102 ofFIG. 1 into the first kernel 104 to generate the second kernel 106. Theinstruction generator 300 includes means to generate binary code (e.g.,binary instructions, machine readable instructions, etc.) based on theprofiling instructions 102. The instruction generator 300 includes meansto insert the generated binary code into the first kernel 104 at one ormore places or positions within the first kernel 104 to generate thesecond kernel 106.

In the illustrated example of FIG. 3, the binary instrumentation engine100 includes the trace analyzer 310 to retrieve and/or otherwise collectthe profile data 110 from the memory 120 of FIG. 1. The trace analyzer310 includes means to extract the trace buffer 124 from the profile data110. The trace analyzer 310 processes the trace buffer 124 by traversingthe trace buffer 124 from a first position (e.g., a beginning) of thetrace buffer 124 to a second position (e.g., an end) of the trace buffer124. For example, a first one of the records 202 of FIG. 2 at the firstposition may have a lower hardware thread ID compared to a second one ofthe records 202 at the second position. In other examples, the first oneof the records 202 at the first position may have lower timestampscompared to the second one of the records 202 at the second position.

In some examples, the trace analyzer 310 includes means to group therecords 202 into one or more sub-traces based on the hardware threadidentifiers. For example, the trace analyzer 310 may sort and/orotherwise organize the records 202 into subsets or groups having thesame hardware thread ID. In such examples, the trace analyzer 310 maygenerate new indices for ones of the records 202 that have the samehardware thread ID. For example, for two records k and m having the samehardware thread identifier where k<m, the trace analyzer 310 may assigna new index of k′ to the record k and a new index of m′ to the record m.For example, if a first one of the records 202 has an index of 24 (e.g.,Record 24) and a hardware thread identifier of 234 and a second one ofthe records 202 has an index of 37 (e.g., Record 37) and the hardwarethread identifier of 234, the trace analyzer 310 may assign an index of0 to the first one of the records 202 and an index of 1 to the secondone of the records 202.

In some examples, the trace analyzer 310 traverses each of thesub-traces from ones of the records 202 having the lower indices to theones of the records 202 having the higher indices. The trace analyzer310 may generate a timeline (e.g., an occupancy timeline) associatedwith each of the records 202 in the sub-traces. For example, the traceanalyzer 310 may select a first one of the records 202 in a sub-trace ofinterest, where the first one of the records 202 has timestampsrepresented by [A,B], where A refers to the first data field 204 and Brefers to the second data field 206 of FIG. 2. The trace analyzer 310may determine that a time interval spanning time A to time B is busywhereas the time outside of the time interval is idle. The traceanalyzer 310 may generate (e.g., iteratively generate) timelines foreach of the records 202 in one or more sub-traces of interest. The traceanalyzer 310 may generate an occupancy map such as an example occupancymap 400 depicted in FIG. 4 based on the one or more timelines.

In the illustrated example of FIG. 3, the binary instrumentation engine100 includes the parameter calculator 320 to determine one or moreoperating parameters associated with the GPU 108 of FIG. 1. In someexamples, the parameter calculator 320 includes means to determine abusy time parameter, an idle time parameter, an occupancy timeparameter, and/or a utilization parameter associated with the GPU 108.In some examples, the parameter calculator 320 determines the one ormore operating parameters based on the occupancy map 400 depicted inFIG. 4. For example, the parameter calculator 320 may determine a busytime parameter for a hardware thread by determining a quantity of timethat the hardware thread is busy during a time period. In otherexamples, the parameter calculator 320 may calculate an idle parameterfor the hardware thread by determining a quantity of time that thehardware thread is idle during the time period. In yet other examples,the parameter calculator 320 may determine a utilization parameter bycalculating a ratio of the busy parameter and a total quantity of timeassociated with a time duration of interest.

In some examples, the parameter calculator 320 determines aggregateoperating parameters that are based on a quantity of hardware threads.For example, the parameter calculator 320 may calculate an aggregateutilization parameter by calculating a ratio of one or more busyhardware threads and a total quantity of hardware threads for a timeduration or time period of interest.

In the illustrated example of FIG. 3, the binary instrumentation engine100 includes the processor optimizer 330 to adjust operation of the CPU112 and/or the GPU 108 based on the occupancy map, the one or moreoperating parameters, etc. In some examples, the processor optimizer 330transmits the one or more operating parameters to the application 114 ofFIG. 1. For example, the processor optimizer 330 may report and/orotherwise communicate a hardware thread utilization, an execution unitutilization, etc., associated with the GPU 108 to developers (e.g.,software developers, processor designers, GPU engineers, etc.) with aperformance analysis tool, a graphical user interface included in theperformance analysis tool, etc. In such examples, the developers mayimprove their software by improving, for example, load balance ofcomputational tasks, provisioning different data distribution amonghardware threads, execution units, etc., of the GPU 108, etc.

In some examples, the processor optimizer 330 includes means to improveand/or otherwise optimize resource scheduling (e.g., hardwarescheduling, memory allocation, etc.) by the CPU 112. For example,developers may develop and/or improve hardware scheduling functions ormechanisms by analyzing the one or more operating parameters associatedwith the GPU 108. In other examples, the processor optimizer 330 invokeshardware, software, firmware, and/or any combination of hardware,software, and/or firmware (e.g., the GPU driver 116, the CPU 112, etc.)to improve operation of the GPU 108. For example, the processoroptimizer 330 may generate and transmit an instruction (e.g., a command,machine readable instructions, etc.) to the GPU driver 116, the CPU 112,etc., of FIG. 1. In response to receiving and/or otherwise executing theinstruction, the GPU driver 116, the CPU 112, etc., is invoked todetermine whether to adjust an operation of the GPU 108. For example,the GPU driver 116, and/or, more generally, the CPU 112 may be called toadjust scheduling of computational tasks, jobs, workloads, etc., to beexecuted by the GPU 108.

In some examples, the processor optimizer 330 invokes the GPU driver 116to analyze one or more operating parameters based on an occupancy map.For example, the GPU driver 116 (or the CPU 112) may compare anoperating parameter to an operating parameter threshold (e.g., a busythreshold, an idle threshold, a utilization threshold, etc.). Forexample, when invoked, the GPU driver 116 (or the CPU 112) may determinethat a utilization of the GPU 108 is 95% corresponding to the GPU 108being busy 95% of a measured time interval. The GPU driver 116 maycompare the utilization of 95% to a utilization threshold of 80% anddetermine that the GPU 108 should not accept more computational tasksbased on the utilization satisfying the utilization threshold (e.g., theutilization is greater than the utilization threshold). As used herein,a job or a workload may refer to a set of one or more computationaltasks to be executed by one or more hardware threads.

In other examples, when invoked by the processor optimizer 330, the GPUdriver 116 (or the CPU 112) may determine that a utilization of the GPU108 is 40%. The GPU driver 116 may compare the utilization of 40% to theutilization threshold of 80% and determine that the GPU 108 hasavailable bandwidth to execute more computational tasks. For example,the GPU driver 116 may determine that the utilization of 40% does notsatisfy the utilization threshold of 80%. In response to determiningthat the utilization of the GPU 108 does not satisfy the utilizationthreshold, the GPU driver 116 may adjust or modify a schedule ofresources to facilitate tasks to be executed by the GPU 108. Forexample, the GPU driver 116 may increase a quantity of computationaltasks that the GPU 108 is currently executing and/or will be executingbased on the utilization parameter.

While an example manner of implementing the binary instrumentationengine 100 of FIG. 1 is illustrated in FIG. 3, one or more of theelements, processes, and/or devices illustrated in FIG. 3 may becombined, divided, re-arranged, omitted, eliminated, and/or implementedin any other way. Further, the example instruction generator 300, theexample trace analyzer 310, the example parameter calculator 320, theexample processor optimizer 330, and/or, more generally, the examplebinary instrumentation engine 100 of FIG. 1 may be implemented byhardware, software, firmware, and/or any combination of hardware,software, and/or firmware. Thus, for example, any of the exampleinstruction generator 300, the example trace analyzer 310, the exampleparameter calculator 320, the example processor optimizer 330, and/or,more generally, the example binary instrumentation engine 100 could beimplemented by one or more analog or digital circuit(s), logic circuits,programmable processor(s), programmable controller(s), graphicsprocessing unit(s) (GPU(s)), digital signal processor(s) (DSP(s)),application specific integrated circuit(s) (ASIC(s)), programmable logicdevice(s) (PLD(s)), and/or field programmable logic device(s) (FPLD(s)).When reading any of the apparatus or system claims of this patent tocover a purely software and/or firmware implementation, at least one ofthe example instruction generator 300, the example trace analyzer 310,the example parameter calculator 320, and/or the example processoroptimizer 330 is/are hereby expressly defined to include anon-transitory computer readable storage device or storage disk such asa memory, a digital versatile disk (DVD), a compact disk (CD), a Blu-raydisk, etc., including the software and/or firmware. Further still, theexample binary instrumentation engine 100 of FIG. 1 may include one ormore elements, processes, and/or devices in addition to, or instead of,those illustrated in FIG. 3, and/or may include more than one of any orall of the illustrated elements, processes, and devices. As used herein,the phrase “in communication,” including variations thereof, encompassesdirect communication and/or indirect communication through one or moreintermediary components, and does not require direct physical (e.g.,wired) communication and/or constant communication, but ratheradditionally includes selective communication at periodic intervals,scheduled intervals, aperiodic intervals, and/or one-time events.

FIG. 4 depicts an example occupancy map 400 generated by the binaryinstrumentation engine 100 of FIGS. 1 and 3. For example, the traceanalyzer 310 of FIG. 3 may generate the occupancy map 400 based on oneor more sub-traces included in the trace buffer 200 processed by thetrace analyzer 310 of FIG. 3. In FIG. 4, the binary instrumentationengine 100 organized the records 202 into example sub-traces 402, 404,406, 408, 410 including a first example sub-trace 402, a second examplesub-trace 404, a third example sub-trace 406, a fourth example sub-trace408, and a fifth example sub-trace 410. For example, a sub-trace mayrefer to a sequence of one or more records corresponding to the samehardware thread identifier.

In the illustrated example of FIG. 4, the first, third, and fourthsub-traces 402, 406, 408 each include one of the records 202. In FIG. 4,the second and fifth sub-traces 404, 410 each include two of the records202. For example, a first one and a second one of the records 202included in the second sub-trace 404 have the same hardware thread IDof 1. Alternatively, the first through fifth sub-traces 402, 404, 406,408, 410 may have a different number of the records 202.

In FIG. 4, the binary instrumentation engine 100 generates the occupancymap 400 by processing the records 202 included in the sub-traces 402,404, 406, 408, 410. For example, the trace analyzer 310 may map the oneor more records 202 included in the sub-traces 402, 404, 406, 408, 410to an example time interval (e.g., a timeline, an occupancy timeline, atime duration, etc.) 412 of the occupancy map 400. For example, thetrace analyzer 310 may map the record 202 of the first sub-trace 402 tothe timeline 412 of the occupancy map defined by [A,B], where Acorresponds to a first timestamp of hardware thread ID 1 and Bcorresponds to a second timestamp of the hardware thread ID 1, where thesecond timestamp is after the first timestamp. The time durationspanning from the first timestamp until the second timestamp correspondsto the timeline 412. For example, the trace analyzer 310 may maptimelines associated with the records 202 (e.g., the timeline 412) togenerate the occupancy map 400, where the timelines represent timedurations during which the corresponding hardware threads are busy. InFIG. 4, the timeline 412 has a starting point at a first positioncorresponding to the first timestamp and has an end point at a secondposition corresponding to the second timestamp. The trace analyzer 310represents, denotes, marks, etc., the time interval between the startingpoint and the end point as busy (e.g., represented in FIG. 4 as arectangle) and represents the time interval outside of the startingpoint and the end point as idle (e.g., represented by empty space).

In some examples, the trace analyzer 310 updates (e.g., iterativelyupdates, continuously updates, etc.) the occupancy map 400 based on(continuously) obtaining and (continuously) processing the trace buffer200. In some examples, the parameter calculator 320 generates the one ormore operating parameters based on the occupancy map 400. For example,the parameter calculator 320 may determine a utilization of hardwarethread identifier 0 included in the GPU 108 by calculating a ratio of abusy time of the hardware thread identifier 0 with respect to a measuredtime period. In other examples, the parameter calculator 320 maydetermine an aggregate utilization of the GPU 108 by calculating a ratioof a first quantity of hardware threads that are busy and a secondquantity of total hardware threads of the GPU 108 for a measured timeperiod.

Flowcharts representative of example hardware logic, machine readableinstructions, hardware implemented state machines, and/or anycombination thereof for implementing the binary instrumentation engine100 of FIGS. 1 and 3 are shown in FIGS. 5-6. The machine readableinstructions may be an executable program or portion of an executableprogram for execution by a computer processor such as the processor 712shown in the example processor platform 700 discussed below inconnection with FIG. 7. The program may be embodied in software storedon a non-transitory computer readable storage medium such as a CD-ROM, afloppy disk, a hard drive, a DVD, a Blu-ray disk, or a memory associatedwith the processor 712, but the entire program and/or parts thereofcould alternatively be executed by a device other than the processor 712and/or embodied in firmware or dedicated hardware. Further, although theexample program is described with reference to the flowchartsillustrated in FIGS. 5-6, many other methods of implementing the examplebinary instrumentation engine 100 may alternatively be used. Forexample, the order of execution of the blocks may be changed, and/orsome of the blocks described may be changed, eliminated, or combined.Additionally or alternatively, any or all of the blocks may beimplemented by one or more hardware circuits (e.g., discrete and/orintegrated analog and/or digital circuitry, an FPGA, an ASIC, acomparator, an operational-amplifier (op-amp), a logic circuit, etc.)structured to perform the corresponding operation without executingsoftware or firmware.

As mentioned above, the example processes of FIGS. 5-6 may beimplemented using executable instructions (e.g., computer and/or machinereadable instructions) stored on a non-transitory computer and/ormachine readable medium such as a hard disk drive, a flash memory, aread-only memory, a compact disk, a digital versatile disk, a cache, arandom-access memory, and/or any other storage device or storage disk inwhich information is stored for any duration (e.g., for extended timeperiods, permanently, for brief instances, for temporarily buffering,and/or for caching of the information). As used herein, the termnon-transitory computer readable medium is expressly defined to includeany type of computer readable storage device and/or storage disk and toexclude propagating signals and to exclude transmission media.

“Including” and “comprising” (and all forms and tenses thereof) are usedherein to be open ended terms. Thus, whenever a claim employs any formof “include” or “comprise” (e.g., comprises, includes, comprising,including, having, etc.) as a preamble or within a claim recitation ofany kind, it is to be understood that additional elements, terms, etc.may be present without falling outside the scope of the correspondingclaim or recitation. As used herein, when the phrase “at least” is usedas the transition term in, for example, a preamble of a claim, it isopen-ended in the same manner as the term “comprising” and “including”are open ended. The term “and/or” when used, for example, in a form suchas A, B, and/or C refers to any combination or subset of A, B, C such as(1) A alone, (2) B alone, (3) C alone, (4) A with B, (5) A with C, (6) Bwith C, and (7) A with B and with C. As used herein in the context ofdescribing structures, components, items, objects and/or things, thephrase “at least one of A and B” is intended to refer to implementationsincluding any of (1) at least one A, (2) at least one B, and (3) atleast one A and at least one B. Similarly, as used herein in the contextof describing structures, components, items, objects and/or things, thephrase “at least one of A or B” is intended to refer to implementationsincluding any of (1) at least one A, (2) at least one B, and (3) atleast one A and at least one B. As used herein in the context ofdescribing the performance or execution of processes, instructions,actions, activities and/or steps, the phrase “at least one of A and B”is intended to refer to implementations including any of (1) at leastone A, (2) at least one B, and (3) at least one A and at least one B.Similarly, as used herein in the context of describing the performanceor execution of processes, instructions, actions, activities and/orsteps, the phrase “at least one of A or B” is intended to refer toimplementations including any of (1) at least one A, (2) at least one B,and (3) at least one A and at least one B.

FIG. 5 is a flowchart representative of example machine readableinstructions 500 which may be executed to implement the binaryinstrumentation engine 100 of FIGS. 1 and 3-4 to improve operation ofthe GPU 108 of FIG. 1. The machine readable instructions 500 begin atblock 502, at which the binary instrumentation engine 100 generatesbinary instructions to be included a kernel to be executed by a GPU. Forexample, the instruction generator 300 (FIG. 3) may instrument the firstkernel 104 of FIG. 1 by generating binary instructions corresponding tothe profiling instructions 102 of FIG. 1 and inserting the binaryinstructions into the first kernel 104 to generate the second kernel 106of FIG. 1.

At block 504, the binary instrumentation engine 100 instructs a GPUdriver to transmit the kernel including the binary instructions to theGPU for execution. For example, the instruction generator 300 maytransmit the second kernel 106 to the GPU driver 116 and instruct theGPU driver 116 to store the second kernel 106 in the memory 120. The GPU108 may retrieve the second kernel 106 form the memory 120 and executethe second kernel 106.

At block 506, the binary instrumentation engine 100 obtains a tracebuffer associated with executing the kernel. For example, the traceanalyzer 310 may retrieve the trace buffer 124 of FIG. 1 or the tracebuffer 200 of FIG. 2 from the memory 120.

At block 508, the binary instrumentation engine 100 processes the tracebuffer to generate an occupancy map. For example, the trace analyzer 310(FIG. 3) may sort and/or otherwise organize the records 202 of FIG. 2into one or more sub-traces such as the sub-traces 402, 404, 406, 408,410 of FIG. 4. In such examples, the trace analyzer 310 may map ones ofthe records 202 included in the sub-traces 402, 404, 406, 408, 410 totimelines to generate the occupancy map 400 of FIG. 4. An exampleprocess that may be used to implement block 508 is described below inconnection with FIG. 6.

At block 510, the binary instrumentation engine 100 determines operatingparameter(s) of the GPU. For example, the parameter calculator 320 (FIG.3) may determine one or more operating parameters such as a busy timeparameter, an idle time parameter, an occupancy time parameter, and/or autilization parameter associated with the GPU 108 executing the secondkernel 106. In some examples, the parameter calculator 320 determinesthe one or more operating parameters based on the information includedin the occupancy map 400 of FIG. 4 such as the timeline 412.

At block 512, the CPU 112 (FIG. 1) determines whether to adjust aworkload of the GPU based on the operating parameter(s). For example,the processor optimizer 330 (FIG. 3) may invoke the GPU driver 116(FIG. 1) to compare a value of an operating parameter to an operatingparameter threshold and determine whether the value satisfies theoperating parameter threshold based on the comparison. For example, theGPU driver 116 may compare a utilization of 50% of the GPU 108 to autilization threshold of 75% and determine that the utilization of 50%does not satisfy the utilization threshold of 75% based on theutilization of 50% being less than the utilization threshold of 75%. Insuch examples, the GPU driver 116 may determine to adjust and/orotherwise modify the workload of the GPU 108 based on the utilization ofthe GPU 108 satisfying the utilization threshold. For example, the GPUdriver 116 may adjust the workload of the GPU 108 by increasing aquantity of computational tasks to be executed by the GPU 108.

If, at block 512, the CPU 112 determines not to adjust the workload ofthe GPU based on the operating parameter(s), control proceeds to block516 to determine whether to generate additional binary instructions. If,at block 512, the CPU 112 determines to adjust the workload of the GPUbased on the operating parameter(s), then, at block 514, the binaryinstrumentation engine 100 invokes the GPU driver to adjust the workloadof the GPU. For example, the processor optimizer 330 may generate acommand, an instruction, etc., to invoke the GPU driver 116 to adjustthe workload of the GPU 108. For example, the GPU driver 116, and/or,more generally, the CPU 112 may determine to increase a quantity ofcomputational tasks to be executed by the GPU 108 when invoked by theinstruction generated by the processor optimizer 330.

At block 516, the binary instrumentation engine 100 determines whetherto generate additional binary instructions. For example, the instructiongenerator 300 may determine to instrument another kernel different fromthe first kernel 104. If, at block 516, the binary instrumentationengine 100 determines to generate additional binary instructions,control returns to block 502 to generate binary instructions to beincluded in another kernel to be executed by the GPU.

If, at block 516, the binary instrumentation engine 100 determines notto generate additional binary instructions, then, at block 518, thebinary instrumentation engine 100 determines whether to continuemonitoring the GPU. For example, the trace analyzer 310 may determine tomaintain retrieving the trace buffer 124 either asynchronously orsynchronously.

If, at block 518, the binary instrumentation engine 100 determines tocontinue monitoring the GPU, control returns to block 506 to obtain thetrace buffer associated with executing the kernel, otherwise the machinereadable instructions 500 of FIG. 5 conclude.

FIG. 6 is a flowchart representative of the machine readableinstructions 508 which may be executed to implement the example binaryinstrumentation engine 100 of FIGS. 1 and 3-4 to process the tracebuffer 124 of FIG. 1 or the trace buffer 200 of FIG. 2 to generate theoccupancy map 400 of FIG. 4. The machine readable instructions 508 beginat block 602, at which the binary instrumentation engine 100 groupsrecords into sub-traces based on hardware thread identifier. Forexample, the trace analyzer 310 (FIG. 3) may organize the records 202 ofFIG. 2 included in the trace buffer 200 based on hardware threadidentifiers of the records 202 into the sub-traces 402, 404, 406, 408,410 of FIG. 4.

At block 604, the binary instrumentation engine 100 selects a sub-traceof interest to process. For example, the trace analyzer 310 may selectthe second sub-trace 404 to process. At block 606, the binaryinstrumentation engine 100 determines whether the sub-trace has morethan one record. For example, the trace analyzer 310 may determine thatthe second sub-trace 404 has two of the records 202, where a first oneof the records 202 has a first index of 2 (Record 2) and a second one ofthe records 202 has a second index of 3 (Record 3).

If, at block 606, the binary instrumentation engine 100 determines thatthe sub-trace does not have more than one record, control proceeds toblock 610 to select a record of interest to process. If, at block 606,the binary instrumentation engine 100 determines that the sub-trace hasmore than one record, then at block 608, the binary instrumentationengine 100 assigns new indices to the records. For example, the traceanalyzer 310 may assign an index of 1 to the first one of the records202 included in the second sub-trace 404 and assign an index of 2 to thesecond one of the records 202 included in the second sub-trace 404.

At block 610, the binary instrumentation engine 100 selects a record ofinterest to process. For example, the trace analyzer 310 may select thefirst one of the records 202 included in the second sub-trace 404 toprocess. At block 612, the binary instrumentation engine 100 maps a timeinterval in the record to an occupancy map. For example, the traceanalyzer 310 may map the time interval represented by [A,B] in the firstone of the records 202 included in the second sub-trace 404 to theoccupancy map 400. The trace analyzer 310 may designate the timeinterval from [A,B] as busy in the occupancy map 400 and designate thetime interval outside of [A,B] as idle.

At block 614, the binary instrumentation engine 100 determines whetherto select another record of interest to process. For example, the traceanalyzer 310 may determine to select the second one of the records 202included in the second sub-trace 404 to process.

If, at block 614, the binary instrumentation engine 100 determines toselect another record of interest to process, control returns to block610 to select another record of interest to process. If, at block 614,the binary instrumentation engine 100 determines not to select anotherrecord of interest to process, then, at block 616, the binaryinstrumentation engine 100 determines whether to select anothersub-trace of interest to process. For example, the trace analyzer 310may determine to select the third sub-trace 406 of the trace buffer 124to process.

If, at block 616, the binary instrumentation engine 100 determines toselect another sub-trace of interest to process, control returns toblock 604 to select another sub-trace of interest to process. If, atblock 616, the binary instrumentation engine 100 determines not toselect another sub-trace of interest to process, control returns toblock 510 of the machine readable instructions 500 of FIG. 5 todetermine operating parameter(s) of the GPU.

FIG. 7 is a block diagram of an example processor platform 700structured to execute the instructions of FIGS. 5-6 to implement thebinary instrumentation engine of FIGS. 1 and 3-4. The processor platform700 can be, for example, a server, a personal computer, a workstation, aself-learning machine (e.g., a neural network), a mobile device (e.g., acell phone, a smart phone, a tablet such as an iPad), a personal digitalassistant (PDA), an Internet appliance, a DVD player, a CD player, adigital video recorder, a Blu-ray player, a gaming console, a personalvideo recorder, a set top box, a headset or other wearable device, orany other type of computing device.

The processor platform 700 of the illustrated example includes aprocessor 712. The processor 712 of the illustrated example is hardware.For example, the processor 712 can be implemented by one or moreintegrated circuits, logic circuits, microprocessors, GPUs, DSPs, orcontrollers from any desired family or manufacturer. The hardwareprocessor may be a semiconductor based (e.g., silicon based) device. Inthis example, the processor 712 implements the example instructiongenerator 300, the example trace analyzer 310, the example parametercalculator 320, and the example processor optimizer 330 of FIG. 3.

The processor 712 of the illustrated example includes a local memory 713(e.g., a cache). The processor 712 of the illustrated example is incommunication with a main memory including a volatile memory 714 and anon-volatile memory 716 via a bus 718. The volatile memory 714 may beimplemented by Synchronous Dynamic Random Access Memory (SDRAM), DynamicRandom Access Memory (DRAM), RAMBUS® Dynamic Random Access Memory(RDRAM®), and/or any other type of random access memory device. Thenon-volatile memory 716 may be implemented by flash memory and/or anyother desired type of memory device. Access to the main memory 714, 716is controlled by a memory controller.

The processor platform 700 of the illustrated example also includes aninterface circuit 720. The interface circuit 720 may be implemented byany type of interface standard, such as an Ethernet interface, auniversal serial bus (USB), a Bluetooth® interface, a near fieldcommunication (NFC) interface, and/or a PCI express interface.

In the illustrated example, one or more input devices 722 are connectedto the interface circuit 720. The input device(s) 722 permit(s) a userto enter data and/or commands into the processor 712. The inputdevice(s) 722 can be implemented by, for example, an audio sensor, amicrophone, a camera (still or video), a keyboard, a button, a mouse, atouchscreen, a track-pad, a trackball, an isopoint device, and/or avoice recognition system.

One or more output devices 724 are also connected to the interfacecircuit 720 of the illustrated example. The output devices 724 can beimplemented, for example, by display devices (e.g., a light emittingdiode (LED), an organic light emitting diode (OLED), a liquid crystaldisplay (LCD), a cathode ray tube display (CRT), an in-place switching(IPS) display, a touchscreen, etc.), a tactile output device, a printer,and/or speaker. The interface circuit 720 of the illustrated example,thus, typically includes a graphics driver card, a graphics driver chip,and/or a graphics driver processor.

The interface circuit 720 of the illustrated example also includes acommunication device such as a transmitter, a receiver, a transceiver, amodem, a residential gateway, a wireless access point, and/or a networkinterface to facilitate exchange of data with external machines (e.g.,computing devices of any kind) via a network 726. The communication canbe via, for example, an Ethernet connection, a digital subscriber line(DSL) connection, a telephone line connection, a coaxial cable system, asatellite system, a line-of-site wireless system, a cellular telephonesystem, etc.

The processor platform 700 of the illustrated example also includes oneor more mass storage devices 728 for storing software and/or data.Examples of such mass storage devices 728 include floppy disk drives,hard drive disks, compact disk drives, Blu-ray disk drives, redundantarray of independent disks (RAID) systems, and digital versatile disk(DVD) drives.

The machine executable instructions 732 of FIGS. 5-6 be stored in themass storage device 728, in the volatile memory 714, in the non-volatilememory 716, and/or on a removable non-transitory computer readablestorage medium such as a CD or DVD.

From the foregoing, it will be appreciated that example methods,apparatus, and articles of manufacture have been disclosed that improveoperation of a processor, a graphics processing unit, etc. The disclosedmethods, apparatus, and articles of manufacture improve the efficiencyof using a computing device by adjusting a resource schedule based onavailable bandwidth of resources. By increasing a quantity ofcomputational tasks to be executed by a GPU based on determining one ormore operating parameters disclosed herein, the GPU may execute morecomputational tasks compared to prior systems. By determining the one ormore parameters disclosed herein, developers can generate kernels thatcan be executed quickly and more efficiently by GPUs compared to priorsystems. The disclosed methods, apparatus, and articles of manufactureare accordingly directed to one or more improvement(s) in thefunctioning of a computer.

The following pertain to further examples disclosed herein.

Example 1 includes an apparatus to improve operation of a graphicsprocessing unit (GPU), the apparatus comprising an instruction generatorto insert profiling instructions into a GPU kernel to generate aninstrumented GPU kernel, the instrumented GPU kernel is to be executedby a GPU, a trace analyzer to generate an occupancy map associated withthe GPU executing the instrumented GPU kernel, a parameter calculator todetermine one or more operating parameters of the GPU based on theoccupancy map, and a processor optimizer to invoke hardware adjust aworkload of the GPU based on the one or more operating parameters.

Example 2 includes the apparatus of example 1, wherein the instructiongenerator is to insert the profiling instructions by inserting a firstsubset of the profiling instructions at a first address of the GPUkernel and inserting a second subset of the profiling instructions at asecond address of the GPU kernel, the first address different from thesecond address.

Example 3 includes the apparatus of example 1, wherein the instrumentedGPU kernel is to cause the GPU to generate a trace buffer includingtimestamps and hardware thread identifiers, the trace buffer includingone or more records, the one or more records each including a first datafield corresponding to a first timestamp included in the timestamps, asecond data field corresponding to a second timestamp included in thetimestamps, and a third data field corresponding to one of the hardwarethread identifiers.

Example 4 includes the apparatus of example 1, wherein the traceanalyzer is to generate the occupancy map by grouping one or morerecords of a trace buffer generated by the GPU into one or moresub-traces based on hardware thread identifiers included in the tracebuffer, the one or more records having first indices, assigning secondindices to the one or more records in the one or more sub-traces whenthe one or more sub-traces have more than one of the one or morerecords, the second indices different from the first indices, andmapping timelines associated with the one or more records to theoccupancy map.

Example 5 includes the apparatus of example 4, wherein the traceanalyzer is to map the timelines to the occupancy map by representingfirst time durations of the occupancy map corresponding to the timelinesas busy and representing second time durations of the occupancy map asidle, the second time durations corresponding to time periods notincluded in the timelines.

Example 6 includes the apparatus of example 1, wherein the one or moreoperating parameters include at least one of a busy time parameter, anidle time parameter, an occupancy time parameter, or a utilizationparameter.

Example 7 includes the apparatus of example 1, wherein the hardware isto adjust the workload of the GPU by comparing a first one of the one ormore operating parameters to a threshold, determining whether toincrease a quantity of computational tasks to be executed by the GPUbased on the comparison, and increasing the quantity of computationaltasks when the first one of the one or more parameters satisfies thethreshold.

Example 8 includes a non-transitory computer readable medium comprisinginstructions which, when executed, cause a machine to at least insertprofiling instructions into a GPU kernel to generate an instrumented GPUkernel, the instrumented GPU kernel is to be executed by a GPU, generatean occupancy map associated with the GPU executing the instrumented GPUkernel, determine one or more operating parameters of the GPU based onthe occupancy map, and adjust a workload of the GPU based on the one ormore operating parameters.

Example 9 includes the non-transitory computer readable medium ofexample 8, further including instructions which, when executed, causethe machine to at least insert a first subset of the profilinginstructions at a first address of the GPU kernel and insert a secondsubset of the profiling instructions at a second address of the GPUkernel, the first address different from the second address.

Example 10 includes the non-transitory computer readable medium ofexample 8, wherein the instrumented GPU kernel is to cause the GPU togenerate a trace buffer including timestamps and hardware threadidentifiers, the trace buffer including one or more records, the one ormore records each including a first data field corresponding to a firsttimestamp included in the timestamps, a second data field correspondingto a second timestamp included in the timestamps, and a third data fieldcorresponding to one of the hardware thread identifiers.

Example 11 includes the non-transitory computer readable medium ofexample 8, further including instructions which, when executed, causethe machine to at least group one or more records of a trace buffergenerated by the GPU into one or more sub-traces based on hardwarethread identifiers included in the trace buffer, the one or more recordshaving first indices, assign second indices to the one or more recordsin the one or more sub-traces when the one or more sub-traces have morethan one of the one or more records, the second indices different fromthe first indices, and map timelines associated with the one or morerecords to the occupancy map.

Example 12 includes the non-transitory computer readable medium ofexample 11, further including instructions which, when executed, causethe machine to at least represent first time durations of the occupancymap corresponding to the timelines as busy and represent second timedurations of the occupancy map as idle, the second time durationscorresponding to time periods not included in the timelines.

Example 13 includes the non-transitory computer readable medium ofexample 8, wherein the one or more operating parameters include at leastone of a busy time parameter, an idle time parameter, an occupancy timeparameter, or a utilization parameter.

Example 14 includes the non-transitory computer readable medium ofexample 8, further including instructions which, when executed, causethe machine to at least compare a first one of the one or more operatingparameters to a threshold, determine whether to increase a quantity ofcomputational tasks to be executed by the GPU based on the comparison,and increase the quantity of computational tasks when the first one ofthe one or more parameters satisfies the threshold.

Example 15 includes an apparatus to improve operation of a graphicsprocessing unit (GPU), the apparatus comprising means for insertingprofiling instructions into a GPU kernel to generate an instrumented GPUkernel, the instrumented GPU kernel is to be executed by a GPU, meansfor generating an occupancy map associated with the GPU executing theinstrumented GPU kernel, means for determining one or more operatingparameters of the GPU based on the occupancy map, and means foradjusting a workload of the GPU based on the one or more operatingparameters.

Example 16 includes the apparatus of example 15, wherein the means forinserting the profiling instructions is to insert a first subset of theprofiling instructions at a first address of the GPU kernel and insert asecond subset of the profiling instructions at a second address of theGPU kernel, the first address different from the second address.

Example 17 includes the apparatus of example 15, wherein theinstrumented GPU kernel is to cause the GPU to generate a trace bufferincluding timestamps and hardware thread identifiers, the trace bufferincluding one or more records, the one or more records each including afirst data field corresponding to a first timestamp included in thetimestamps, a second data field corresponding to a second timestampincluded in the timestamps, and a third data field corresponding to oneof the hardware thread identifiers.

Example 18 includes the apparatus of example 15, wherein the means forgenerating the occupancy map is to group one or more records of a tracebuffer generated by the GPU into one or more sub-traces based onhardware thread identifiers included in the trace buffer, the one ormore records having first indices, assign second indices to the one ormore records in the one or more sub-traces when the one or moresub-traces have more than one of the one or more records, the secondindices different from the first indices, and map timelines associatedwith the one or more records to the occupancy map.

Example 19 includes the apparatus of example 18, wherein the means forgenerating the occupancy map is to map the timelines to the occupancymap by representing first time durations of the occupancy mapcorresponding to the timelines as busy and representing second timedurations of the occupancy map as idle, the second time durationscorresponding to time periods not included in the timelines.

Example 20 includes the apparatus of example 15, wherein the one or moreoperating parameters include at least one of a busy time parameter, anidle time parameter, an occupancy time parameter, or a utilizationparameter.

Example 21 includes the apparatus of example 15, wherein the means foradjusting the workload of the GPU is to compare a first one of the oneor more operating parameters to a threshold, determine whether toincrease a quantity of computational tasks to be executed by the GPUbased on the comparison, and increase the quantity of computationaltasks when the first one of the one or more parameters satisfies thethreshold.

Example 22 includes a method to improve operation of a graphicprocessing unit (GPU), the method comprising inserting profilinginstructions into a GPU kernel to generate an instrumented GPU kernel,the instrumented GPU kernel is to be executed by a GPU, generating anoccupancy map associated with the GPU executing the instrumented GPUkernel, determining one or more operating parameters of the GPU based onthe occupancy map, and adjusting a workload of the GPU based on the oneor more operating parameters.

Example 23 includes the method of example 22, wherein the instrumentedGPU kernel is to cause the GPU to generate a trace buffer includingtimestamps and hardware thread identifiers, the trace buffer includingone or more records, the one or more records each including a first datafield corresponding to a first timestamp included in the timestamps, asecond data field corresponding to a second timestamp included in thetimestamps, and a third data field corresponding to one of the hardwarethread identifiers.

Example 24 includes the method of example 22, further including groupingone or more records of a trace buffer generated by the GPU into one ormore sub-traces based on hardware thread identifiers included in thetrace buffer, the one or more records having first indices, assigningsecond indices to the one or more records in the one or more sub-traceswhen the one or more sub-traces have more than one of the one or morerecords, the second indices different from the first indices, andmapping timelines associated with the one or more records to theoccupancy map.

Example 25 includes the method of example 22, further includingcomparing a first one of the one or more operating parameters to athreshold, determining whether to increase a quantity of computationaltasks to be executed by the GPU based on the comparison, and increasingthe quantity of computational tasks when the first one of the one ormore parameters satisfies the threshold.

Although certain example methods, apparatus, and articles of manufacturehave been disclosed herein, the scope of coverage of this patent is notlimited thereto. On the contrary, this patent covers all methods,apparatus, and articles of manufacture fairly falling within the scopeof the claims of this patent.

1. (canceled)
 2. An apparatus to improve operation of a graphicsprocessing unit (GPU), the apparatus comprising: instructions; and atleast one processor to execute the instructions to: populate a tracebuffer based on one or more records, corresponding ones of the recordshaving a hardware thread identifier and a timestamp, the one or morerecords generated in response to an execution of an instrumented GPUkernel by the GPU; generate one or more sub-traces based on at leastsome of the hardware thread identifiers; determine one or more timelinesassociated with the timestamps of the one or more sub-traces; generatean occupancy map associated with the GPU based on the one or moretimelines; and adjust a workload of the GPU based on the occupancy map.3. The apparatus of claim 2, wherein the one or more records include afirst record having a first index, the one or more sub-traces include afirst sub-trace including the first record, and the at least oneprocessor is to assign a second index to the first record in response toa determination the first sub-trace includes more than one of the one ormore records.
 4. The apparatus of claim 2, wherein the instructions arefirst instructions, and the at least one processor is to insertprofiling instructions into a GPU kernel to generate the instrumentedGPU kernel, the instrumented GPU kernel to be executed by a GPU.
 5. Theapparatus of claim 4, wherein the at least one processor is to insertthe profiling instructions by inserting a first subset of the profilinginstructions at a first address of the GPU kernel and inserting a secondsubset of the profiling instructions at a second address of the GPUkernel, the first address different from the second address.
 6. Theapparatus of claim 2, wherein the at least one processor is to map theone or more timelines to the occupancy map by representing first timedurations of the occupancy map corresponding to the timelines as busyand representing second time durations of the occupancy map as idle, thesecond time durations corresponding to time periods not included in thetimelines.
 7. The apparatus of claim 2, wherein the at least oneprocessor is to determine one or more operating parameters of the GPUbased on the occupancy map, the workload of the GPU to be adjusted basedon the one or more operating parameters.
 8. The apparatus of claim 7,wherein the one or more operating parameters include at least one of abusy time parameter, an idle time parameter, an occupancy timeparameter, or a utilization parameter.
 9. A non-transitory computerreadable storage medium comprising instructions that, when executed,cause at least one processor to at least: populate a trace buffer basedon one or more records, corresponding ones of the records having ahardware thread identifier and a timestamp, the one or more recordsgenerated in response to an execution of an instrumented graphicsprocessing unit (GPU) kernel by a GPU; generate one or more sub-tracesbased on at least some of the hardware thread identifiers; determine oneor more timelines associated with the timestamps of the one or moresub-traces; generate an occupancy map associated with the GPU based onthe one or more timelines; and adjust a workload of the GPU based on theoccupancy map.
 10. The non-transitory computer readable storage mediumof claim 9, wherein the one or more records include a first recordhaving a first index, the one or more sub-traces include a firstsub-trace including the first record, and the instructions, whenexecuted, cause the at least one processor to assign a second index tothe first record in response to a determination the first sub-traceincludes more than one of the one or more records.
 11. Thenon-transitory computer readable storage medium of claim 9, wherein theinstructions are first instructions and, when executed, the firstinstructions cause the at least one processor to insert profilinginstructions into a GPU kernel to generate the instrumented GPU kernel,the instrumented GPU kernel to be executed by a GPU.
 12. Thenon-transitory computer readable storage medium of claim 11, wherein thefirst instructions, when executed, cause the at least one processor toinsert the profiling instructions by inserting a first subset of theprofiling instructions at a first address of the GPU kernel andinserting a second subset of the profiling instructions at a secondaddress of the GPU kernel, the first address different from the secondaddress.
 13. The non-transitory computer readable storage medium ofclaim 9, wherein the instructions, when executed, cause the at least oneprocessor to map the one or more timelines to the occupancy map byrepresenting first time durations of the occupancy map corresponding tothe timelines as busy and representing second time durations of theoccupancy map as idle, the second time durations corresponding to timeperiods not included in the timelines.
 14. The non-transitory computerreadable storage medium of claim 9, wherein the instructions, whenexecuted, cause the at least one processor to determine one or moreoperating parameters of the GPU based on the occupancy map, the workloadof the GPU to be adjusted based on the one or more operating parameters.15. The non-transitory computer readable storage medium of claim 14,wherein the one or more operating parameters include at least one of abusy time parameter, an idle time parameter, an occupancy timeparameter, or a utilization parameter.
 16. A method to improve operationof a graphics processing unit (GPU), the method comprising: populating atrace buffer based on one or more records in response to an execution ofan instrumented GPU kernel by the GPU, corresponding ones of the recordshaving a hardware thread identifier and a timestamp; generating one ormore sub-traces based on at least some of the hardware threadidentifiers; determining one or more timelines associated with thetimestamps of the one or more sub-traces; generating an occupancy mapassociated with the GPU based on the one or more timelines; andadjusting a workload of the GPU based on the occupancy map.
 17. Themethod of claim 16, wherein the one or more records include a firstrecord having a first index, the one or more sub-traces include a firstsub-trace including the first record, and further including assigning asecond index to the first record in response to a determination thefirst sub-trace includes more than one of the one or more records. 18.The method of claim 16, further including inserting profilinginstructions into a GPU kernel to generate the instrumented GPU kernel.19. The method of claim 18, further including inserting the profilinginstructions by inserting a first subset of the profiling instructionsat a first address of the GPU kernel and inserting a second subset ofthe profiling instructions at a second address of the GPU kernel, thefirst address different from the second address.
 20. The method of claim16, further including mapping the one or more timelines to the occupancymap by representing first time durations of the occupancy mapcorresponding to the timelines as busy and representing second timedurations of the occupancy map as idle, the second time durationscorresponding to time periods not included in the timelines.
 21. Themethod of claim 16, further including determining one or more operatingparameters of the GPU based on the occupancy map, the workload of theGPU to be adjusted based on the one or more operating parameters.